Hosted by the IIGB Bioinformatics Core
Brandon Le
Genomics Building 1207G
October 18-19, 2023
RNA sequencing or RNA-seq is one of many methods used for gene expression studies by obtaining a snapshot of the RNA molecules within a biological system.
Van den Berge et. al (2019) Annu Rev Biomed Data Sci
Next-generation sequencing (NGS) technologies
Sequencing resolution
You should receive an email with login information if you’ve never had an account on the HPCC. For those with an account already, you can use the same credentials to log into the server.
Open up your terminal program (on your Mac or PC) and at the $ prompt, enter the following command (change ‘username’ to your account name)
We will clone the GitHub repository for this workshop. The repo is stored at:
Cloning the repository:
Update changes made to the repository:
The directory structure should look like this:
We will create a conda environment containing bioinformatics software packages not readily available on the cluster for data processing and analysis. This conda environment includes special R packages (e.g., EnhancedVolcano) and software (e.g.,multiqc).
To create the conda environment:
To use the conda environment:
Quality Control
Sequence Alignment
Quantification
Differential expression
Data Visualization
R and R packages
BioProject: PRJNA950346
GEO Series: GSE228555
Title: Transcriptome expression of WT and mir163 mutant
Organism: Arabidopsis thaliana
Ecotype: Col-0
Genotype(s): WT, miR163 mutant
Biological replicates: 3
Tissue: Seedlings
Library kit: NEBNext® Ultra™ RNA Library Prep Kit for Illumina (non-stranded)
Sequencing: Illumina paired-end 150bp
Fastqc and trim_galoreSTAR alignerSTARfeatureCountsDESeq2Generate a metadata.csv file containing information about the dataset. The metadata will be used for processing and differential expression analysis.
The metadata for this dataset are:
srr_id,ecotype,genotype,treatment,tissue,gsm_id,biorep,samplename
Note
An extensive metadata file describes the samples in detail. However, the minimum metadata file should contain the following:
Other metadata info can include:
srr_id,ecotype,genotype,treatment,tissue,biorep,samplename,fq1,fq2
SRR24016000,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1,mir163_1,raw/SRR24016000_1.fastq.gz,raw/SRR24016000_2.fastq.gz
SRR24016001,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2,mir163_2,raw/SRR24016001_1.fastq.gz,raw/SRR24016001_2.fastq.gz
SRR24016002,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3,mir163_3,raw/SRR24016002_1.fastq.gz,raw/SRR24016002_2.fastq.gz
SRR24016003,Col-0,wt,7-day-old seedlings without treatment,seedlings,1,wt_1,raw/SRR24016003_1.fastq.gz,raw/SRR24016003_2.fastq.gz
SRR24016004,Col-0,wt,7-day-old seedlings without treatment,seedlings,2,wt_2,raw/SRR24016004_1.fastq.gz,raw/SRR24016004_2.fastq.gz
SRR24016005,Col-0,wt,7-day-old seedlings without treatment,seedlings,3,wt_3,raw/SRR24016005_1.fastq.gz,raw/SRR24016005_2.fastq.gzThe metadata_sub1M.csv contains info referencing the sub-sampled dataset
srr_id,ecotype,genotype,treatment,tissue,biorep,samplename,fq1,fq2
SRR24016000_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1,mir163_1,raw/SRR24016000_sub1M_1.fastq.gz,raw/SRR24016000_sub1M_2.fastq.gz
SRR24016001_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2,mir163_2,raw/SRR24016001_sub1M_1.fastq.gz,raw/SRR24016001_sub1M_2.fastq.gz
SRR24016002_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3,mir163_3,raw/SRR24016002_sub1M_1.fastq.gz,raw/SRR24016002_sub1M_2.fastq.gz
SRR24016003_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,1,wt_1,raw/SRR24016003_sub1M_1.fastq.gz,raw/SRR24016003_sub1M_2.fastq.gz
SRR24016004_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,2,wt_2,raw/SRR24016004_sub1M_1.fastq.gz,raw/SRR24016004_sub1M_2.fastq.gz
SRR24016005_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,3,wt_3,raw/SRR24016005_sub1M_1.fastq.gz,raw/SRR24016005_sub1M_2.fastq.gzFastq files contain the raw sequence and the base quality score. Each sequence is comprised of four lines.
The sequence header (starts with @) and contain identifiers for the read
The sequence
'+'
Base quality scores @VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
NATGGGACAGACATGCTGGCGGCACTCACTCACTTGGGCGGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGATCTCGTATTCCCTTTTTTTTTTGTAAATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#-C;;CCCCC-C-CCCCCCCCC-CCCCCCCCCCCCCCCCCCC-CCCCCCCC-CC-CCCCCCCCC-CC-CCCCCCCCC;-C--CC-CCCC--CC-C--;---;C----;-;CC-------C-C-C-C--CCCC-;C;CCC-CCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:20125:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCCAGCCCCAGCGACTCCTAATAAAGCATTTCAGCAAATAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCAAGTCACAGTTCAGGATGTGGTTTTTGGTTTTTTTTTTTTTAAATTTTGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC--CCCCCCCCCCCCCCCC;CCCC-CCCCCCCCCC---C-C---C-CC-CCCCC;CC-CC-C--;C-CC-------C-C---C-C----CC;C-CCCC;C;CCCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:21034:1000 1:N:0:AGTTCAGG+TCTGTTGG
NTTGCAATGCTCAATAAGTCTATTCCACCTCAGTGTCCTTTTTAAAGAGTTTTGGAAAAAAAAAAAAAAAAAAAGATCGTAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGTTGTGGGTTTTCGTGTTTTTGTTTTTATTTTTGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCC;C;CCCCCCCCCCCCCCCCCCCCCCCCCC-CC;C-CC-C;CCCCCC;C;CC-CCCCC;;;CCC-;;C;-;-C---C-C;;;--C-CCCC---C;C-----C;--C-
@VH01192:45:AAC7JMMM5:1:1101:21488:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTTGGTATGTGAAATTTTTTTAATACATTTAAATTTTATGTTTTTGTTTTTCTTTTTTTTTTTTTTAAATATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-;---C-C;-;------CCCCCC;--CC--C-C-;-C-C--;C;-----C-C--CCCC-C--;C-C---C-;C;-C-C;CCCCCCCCC;CCCCC-CCCCCCCCCCC;C;CCCCCCCC;CCC
@VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
unique instrument id
run id
flowcell id
flowcell lane
tile number within the flowcell lane
x-coordinate of the cluster within the tile
y-coordinate of the cluster within the tile
Member of a read/mate pair, 1 or 2
Y if the read is filtered, N otherwise
0 when none of the control bits are on
Index sequenceWe will run QC using Fastqc and trim_galore.
trim_galore will also run fastqc on the trimmed dataset.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00 # 4 hours
#SBATCH --job-name="fastqc"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu
module load fastqc
# Load variables
SRR_ID=$1
IN_DIR=raw
OUT_DIR=analysis/fastqc
# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
# -t : threads
# --noextract : do not extract zip output files
fastqc -f fastq -t 4 --noextract -o $OUT_DIR $IN_DIR/${SRR_ID}*.fastq.gz
$ ls analysis/fastqc
SRR24016000_1_fastqc.html
SRR24016000_1_fastqc.zip
SRR24016000_2_fastqc.html
SRR24016000_2_fastqc.zip
SRR24016001_1_fastqc.html
SRR24016001_1_fastqc.zip
SRR24016001_2_fastqc.html
SRR24016001_2_fastqc.zip
SRR24016002_1_fastqc.html
SRR24016002_1_fastqc.zip
SRR24016002_2_fastqc.html
SRR24016002_2_fastqc.zip
SRR24016003_1_fastqc.html
SRR24016003_1_fastqc.zip
SRR24016003_2_fastqc.html
SRR24016003_2_fastqc.zip
SRR24016004_1_fastqc.html
SRR24016004_1_fastqc.zip
SRR24016004_2_fastqc.html
SRR24016004_2_fastqc.zip
SRR24016005_1_fastqc.html
SRR24016005_1_fastqc.zip
SRR24016005_2_fastqc.html
SRR24016005_2_fastqc.zip#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=06:00:00 # 6 hours
#SBATCH --job-name="trim_galore"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu
################### Trim_galore.sh ###################
# Load conda environment
module load trim_galore
# Reading in variables
SAMPLENAME=$1
FQ1=$2
FQ2=$3
OUT_DIR=analysis/trim_galore
# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
# Running trim_galore
trim_galore --fastqc --gzip --trim-n -j 4 --paired -o $OUT_DIR --basename ${SAMPLENAME} ${FQ1} ${FQ2}
# Running trim_galore (single-end)
# trim_galore --fastqc --gzip --trim-n -j 4 -o $OUT_DIR --basename ${SAMPLENAME} ${FQ1}The ‘val_1’ and ‘val_2’ files are the validated files after trim_galore processing. These files will be the input for the STAR alignment.
$ ls analysis/trim_galore
mir163_1_val_1_fastqc.html
mir163_1_val_1_fastqc.zip
mir163_1_val_1.fq.gz
mir163_1_val_2_fastqc.html
mir163_1_val_2_fastqc.zip
mir163_1_val_2.fq.gz
mir163_2_val_1_fastqc.html
mir163_2_val_1_fastqc.zip
mir163_2_val_1.fq.gz
mir163_2_val_2_fastqc.html
mir163_2_val_2_fastqc.zip
mir163_2_val_2.fq.gz
mir163_3_val_1_fastqc.html
mir163_3_val_1_fastqc.zip
mir163_3_val_1.fq.gz
mir163_3_val_2_fastqc.html
mir163_3_val_2_fastqc.zip
mir163_3_val_2.fq.gz
SRR24016000_1.fastq.gz_trimming_report.txt
SRR24016000_2.fastq.gz_trimming_report.txt
SRR24016001_1.fastq.gz_trimming_report.txt
SRR24016001_2.fastq.gz_trimming_report.txt
SRR24016002_1.fastq.gz_trimming_report.txt
SRR24016002_2.fastq.gz_trimming_report.txt
SRR24016003_1.fastq.gz_trimming_report.txt
SRR24016003_2.fastq.gz_trimming_report.txt
SRR24016004_1.fastq.gz_trimming_report.txt
SRR24016004_2.fastq.gz_trimming_report.txt
SRR24016005_1.fastq.gz_trimming_report.txt
SRR24016005_2.fastq.gz_trimming_report.txt
wt_1_val_1_fastqc.html
wt_1_val_1_fastqc.zip
wt_1_val_1.fq.gz
wt_1_val_2_fastqc.html
wt_1_val_2_fastqc.zip
wt_1_val_2.fq.gz
wt_2_val_1_fastqc.html
wt_2_val_1_fastqc.zip
wt_2_val_1.fq.gz
wt_2_val_2_fastqc.html
wt_2_val_2_fastqc.zip
wt_2_val_2.fq.gz
wt_3_val_1_fastqc.html
wt_3_val_1_fastqc.zip
wt_3_val_1.fq.gz
wt_3_val_2_fastqc.html
wt_3_val_2_fastqc.zip
wt_3_val_2.fq.gzTo run the STAR aligner, we first need to create an index of the genome.
To generate an index, you’ll need the following files:
FASTA: Fasta file containing sequences of all the chromosomes/scaffolds for the genome
GTF/GFF: General Transfer Format (GTF) or General Feature Format (GFF)
Note
You only need to run the indexing step once. The indexed files can be used for alignments for all subsequent samples
Warning
For assembled genomes containing large number of scaffolds or large genomes, this process can take some time and may require adjusting the indexing parameters.
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=32
#SBATCH --mem=100g
#SBATCH --time=1-23:00:00
#SBATCH --job-name="STAR-index"
#SBATCH --output=log/%x_%j.log
## Load environment
module load star
NAME=ara11
FASTA=genome/genome/TAIR10_Chr.all.fasta
GTF=genome/Araport11_GFF3_genes_transposons.201606.gtf
GFF=genome/genome/Araport11_GFF3_genes_transposons.201606.gff
IDX_DIR=index
PARAMS="--runThreadN 32
--runMode genomeGenerate
--genomeDir $IDX_DIR/$NAME
--genomeFastaFiles $FASTA
--sjdbGTFfile $GTF
--sjdbOverhang 99
--genomeSAindexNbases 12"
STAR $PARAMS>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCT
TGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTA
CATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTA
TGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATT
TAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGChr1 Araport11 5UTR 3631 3759 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 3631 3913 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 start_codon 3760 3762 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 3760 3913 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 3996 4276 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 3996 4276 . + 2 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 4486 4605 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 4486 4605 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 4706 5095 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 4706 5095 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 5174 5326 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 5174 5326 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 CDS 5439 5630 . + 0 transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 5439 5899 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 stop_codon 5628 5630 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 3UTR 5631 5899 . + . transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1 Araport11 exon 6788 7069 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1 Araport11 3UTR 6788 7069 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1 Araport11 3UTR 7157 7314 . - . transcript_id "AT1G01020.2"; gene_id "AT1G01020"##gff-version 3
Chr1 Araport11 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;Dbxref=PMID:11118137,PMID:12820902,PMID:15029955,PMID:15010618,PMID:15108305,PMID:15173566,PMID:15282545,PMID:16504176,PMID:16547105,PMID:16524982,PMID:16720694,PMID:16552445,PMID:17339215,PMID:17448460,PMID:17447913,PMID:18650403,PMID:20736450,PMID:24377444,locus:2200935;locus_type=protein_coding
Chr1 Araport11 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Note=NAC domain containing protein 1;conf_class=2;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;conf_rating=****;Dbxref=PMID:11118137,gene:2200934,UniProt:Q0WV96
Chr1 Araport11 five_prime_UTR 3631 3759 . + . ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:five_prime_UTR:1
Chr1 Araport11 exon 3631 3913 . + . ID=AT1G01010:exon:1;Parent=AT1G01010.1;Name=NAC001:exon:1
Chr1 Araport11 CDS 3760 3913 . + 0 ID=AT1G01010:CDS:1;Parent=AT1G01010.1;Name=NAC001:CDS:1
Chr1 Araport11 exon 3996 4276 . + . ID=AT1G01010:exon:2;Parent=AT1G01010.1;Name=NAC001:exon:2
Chr1 Araport11 CDS 3996 4276 . + 2 ID=AT1G01010:CDS:2;Parent=AT1G01010.1;Name=NAC001:CDS:2
Chr1 Araport11 exon 4486 4605 . + . ID=AT1G01010:exon:3;Parent=AT1G01010.1;Name=NAC001:exon:3
Chr1 Araport11 CDS 4486 4605 . + 0 ID=AT1G01010:CDS:3;Parent=AT1G01010.1;Name=NAC001:CDS:3
Chr1 Araport11 exon 4706 5095 . + . ID=AT1G01010:exon:4;Parent=AT1G01010.1;Name=NAC001:exon:4
Chr1 Araport11 CDS 4706 5095 . + 0 ID=AT1G01010:CDS:4;Parent=AT1G01010.1;Name=NAC001:CDS:4
Chr1 Araport11 exon 5174 5326 . + . ID=AT1G01010:exon:5;Parent=AT1G01010.1;Name=NAC001:exon:5
Chr1 Araport11 CDS 5174 5326 . + 0 ID=AT1G01010:CDS:5;Parent=AT1G01010.1;Name=NAC001:CDS:5
Chr1 Araport11 CDS 5439 5630 . + 0 ID=AT1G01010:CDS:6;Parent=AT1G01010.1;Name=NAC001:CDS:6
Chr1 Araport11 exon 5439 5899 . + . ID=AT1G01010:exon:6;Parent=AT1G01010.1;Name=NAC001:exon:6
Chr1 Araport11 three_prime_UTR 5631 5899 . + . ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:three_prime_UTR:1Now we can align the reads from each sample to the reference genome.
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=16
#SBATCH --mem=100g
#SBATCH --time=0-8:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="STAR-align"
#SBATCH --output=log/%x_%j.log
##################################################################
## Load STAR environment
module load star
## Setting variables
SAMPLENAME=$1
IN_DIR=analysis/trim_galore
OUT_DIR=analysis/star
IDX_DIR=index
# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
## Setting STAR parameters
PARAMS=" --runThreadN 16
--runMode alignReads
--genomeDir ${IDX_DIR}
--readFilesCommand zcat
--outSAMtype BAM SortedByCoordinate
--outFileNamePrefix ${OUT_DIR}/${SAMPLENAME}_
--outFilterMismatchNmax 0
--outFilterMultimapNmax 1
--quantMode GeneCounts
--outWigType wiggle"
## Running STAR
STAR $PARAMS --readFilesIn $IN_DIR/${SAMPLENAME}_val_1.fq.gz $IN_DIR/${SAMPLENAME}_val_2.fq.gz
## Indexing bam files
samtools index $OUT_DIR/${SAMPLENAME}_*bamwt_1_Aligned.sortedByCoord.out.bam
wt_1_Aligned.sortedByCoord.out.bam.bai
wt_1_Log.final.out
wt_1_Log.out
wt_1_Log.progress.out
wt_1_ReadsPerGene.out.tab
wt_1_Signal.UniqueMultiple.str1.out.wig
wt_1_Signal.UniqueMultiple.str2.out.wig
wt_1_Signal.Unique.str1.out.wig
wt_1_Signal.Unique.str2.out.wig
wt_1_SJ.out.tab Started job on | Oct 16 15:01:05
Started mapping on | Oct 16 15:01:06
Finished on | Oct 16 15:14:11
Mapping speed, Million of reads per hour | 94.20
Number of input reads | 20540486
Average input read length | 297
UNIQUE READS:
Uniquely mapped reads number | 19330715
Uniquely mapped reads % | 94.11%
Average mapped length | 291.75
Number of splices: Total | 18632044
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 18418926
Number of splices: GC/AG | 175340
Number of splices: AT/AC | 8071
Number of splices: Non-canonical | 29707
Mismatch rate per base, % | 0.00%
Deletion rate per base | 0.00%
Deletion average length | 1.51
Insertion rate per base | 0.00%
Insertion average length | 1.22
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 389809
% of reads mapped to too many loci | 1.90%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 819554
% of reads unmapped: too short | 3.99%
Number of reads unmapped: other | 408
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%Sequence alignment map (SAM) format official documentation is available here. The SAM file consists of a header and rows for each read. Each row contains 11 mandatory fields. BAM is the binary compressed version of SAM.
For decoding the SAM Flags, check out this website from the Broad Institute.
A00738:657:H72KHDSX7:1:2513:31566:1110 163 chr01 2960 255 150M = 3065 254 CACCCACAGGCACCACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:1330:26928:13870 99 chr01 2983 255 149M = 3071 223 GTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:1 HI:i:1 AS:i:282 nM:i:0
A00738:657:H72KHDSX7:1:1159:8594:10551 163 chr01 3001 255 150M = 3069 218 GAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:298 nM:i:0
A00738:657:H72KHDSX7:1:1640:6551:15311 163 chr01 3006 255 149M = 3088 232 GACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGAC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:2574:25084:30937 99 chr01 3007 255 149M = 3207 448 ACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF:FF NH:i:1 HI:i:1 AS:i:297 nM:i:0
A00738:657:H72KHDSX7:1:1327:12717:27023 99 chr01 3008 255 149M = 3086 227 CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:296 nM:i:0A00738:657:H72KHDSX7:1:1327:12717:27023
99
chr01
3008
255
149M
=
3086
227
CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF
NH:i:1 HI:i:1 AS:i:296 nM:i:0The BAM file produced from STAR contains alignment information that can be viewed in IGV. However, this file can be big and sluggish so one option is to convert the BAM file to a smaller BigWig format that will allow fast visualization of the results.
Note
The BigWig format will aggregate and bin your data across the genome. Therefore, you will lose individual read information. For observing SNPs, you’ll want to use the BAM files instead.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=100G
#SBATCH --time=02:00:00
#SBATCH --job-name="bam2bw"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu
# Shell script to generate BigWig files from STAR bam output
# Since the dataset is non-stranded, we can convert directly from the BAM file
# Load deeptools
module load deeptools
# Set variables
SAMPLENAME=$1
IN_DIR=analysis/star
bamCoverage -b $IN_DIR/${SAMPLENAME}_Aligned.sortedByCoord.out.bam \
-o $IN_DIR/${SAMPLENAME}.bw \
-p 4 \
--normalizeUsing CPM \
--exactScaling \
--skipNAsFastqc and trim_galoreSTAR alignerSTARfeatureCountsDESeq2After aligning the reads to the genome, we must quantify total reads associated with each genome feature (e.g., gene).
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=2
#SBATCH --mem=12g
#SBATCH --time=0-4:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="featurecount"
#SBATCH --output=log/%x_%j.log
##################################################################
## Shell script to quantify read counts with FeatureCounts
## Load modules
module load subread # v2.0.3
## Setting variables
IN_DIR=analysis/star
OUT_DIR=analysis/featurecounts
STRAND=0 # (0-unstrand, 1-strand, 2-reverse strand)
GTF=genome/Araport11_GFF3_genes_transposons.201606.gtf
# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
## Set featuercounts params
# -p : indicates paired-end reads
# --countReadPairs : count read pairs instead of reads
# --primary : count primary alignments only
PARAMS="-T 2 \
-s $STRAND \
-a $GTF \
-p \
-t exon \
-g gene_id \
--countReadPairs \
--primary \
-o ${OUT_DIR}/featurecounts.txt"
## Running featureCounts
featureCounts ${PARAMS} $IN_DIR/*bam #Program:featureCounts v2.0.6; Command:"featureCounts" "-T" "2" "-s" "0" "-a" "genome/Araport11_GFF3_genes_transposons.201606.gtf" "-p" "-t" "exon" "-g" "gene_id" "--countReadPairs" "--primary" "-o" "analysis/featurecounts/featurecounts.txt" "analysis/star/mir163_1_Aligned.sortedByCoord.out.bam" "analysis/star/mir163_2_Aligned.sortedByCoord.out.bam" "analysis/star/mir163_3_Aligned.sortedByCoord.out.bam" "analysis/star/wt_1_Aligned.sortedByCoord.out.bam" "analysis/star/wt_2_Aligned.sortedByCoord.out.bam" "analysis/star/wt_3_Aligned.sortedByCoord.out.bam"
Geneid Chr Start End Strand Length analysis/star/mir163_1_Aligned.sortedByCoord.out.bam analysis/star/mir163_2_Aligned.sortedByCoord.out.bam analysis/star/mir163_3_Aligned.sortedByCoord.out.bam analysis/star/wt_1_Aligned.sortedByCoord.out.bam analysis/star/wt_2_Aligned.sortedByCoord.out.bam analysis/star/wt_3_Aligned.sortedByCoord.out.bam
AT1G01010 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 3631;3996;4486;4706;5174;5439 3913;4276;4605;5095;5326;5899 +;+;+;+;+;+ 1688185 186 248 310 140 258
AT1G01020 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 6788;6788;6788;6788;6788;6788;7157;7157;7157;7157;7157;7157;7384;7384;7384;7384;7564;7564;7564;7564;7564;7564;7762;7762;7762;7762;7762;7942;7942;7942;7942;7942;8236;8236;8236;8236;8236;8236;8417;8417;8417;8417;8571;8571;8571;8594;8594;8594 7069;7069;7069;7069;7069;7069;7232;7232;7232;7232;7450;7450;7450;7450;7450;7450;7649;7649;7649;7649;7649;7649;7835;7835;7835;7835;7835;7987;7987;7987;7987;7987;8464;8325;8464;8325;8325;8325;8464;8464;8464;8464;9130;9130;8737;9130;9130;8737 -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 1571 294 265 413 458 293334
AT1G03987 Chr1 11101 11372 + 272 23 15 17 29 16 20
AT1G01030 Chr1;Chr1;Chr1;Chr1;Chr1 11649;11649;12424;13335;13335 12354;13173;13173;13714;13714 -;-;-;-;- 1905188 216 234 262 192 205
AT1G01040 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 23121;23416;24542;24542;24752;24752;25041;25041;25524;25524;25825;25825;26081;26081;26292;26292;26543;26543;26862;26862;27099;27099;27372;27372;27618;27618;27803;27803;28708;28708;28890;28890;29160;29160;30147;30147;30410;30410;30902;30902 24451;24451;24655;24655;24962;24962;25435;25435;25743;25743;25997;25997;26203;26203;26452;26452;26776;26776;27012;27012;27281;27281;27536;27533;27713;27713;28431;28431;28805;28805;29080;29080;30065;30065;30311;30311;30816;30816;31120;31227 +;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+ 6279 1066 1034 14131462 864 1138
AT1G03993 Chr1 23312 24099 - 788 0 0 00 0 0
AT1G01050 Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1 31170;31382;31521;31521;31693;31693;31933;31933;32088;32088;32282;32282;32431;32431;32547;32547;32839;33029 31424;31424;31602;31602;31813;31813;31998;31998;32195;32195;32347;32347;32459;32459;32670;32670;33009;33171 -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 1165 999 1218 1258 1130 11611269
AT1G03997 Chr1 32727 33009 + 283 0 0 00 0 0DESeq2 is an R package that allows differential expression analysis.
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=16
#SBATCH --mem=100g
#SBATCH --time=0-8:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="deseq2"
#SBATCH --output=log/%x_%j.log
##################################################################
####################
# Executing Rscript
# conda activate DEG-analysis
# Rscript --vanilla DESeq2.R
####################
# Load conda environment
source activate DEG-analysis
# Run R script
Rscript --vanilla code/DESeq2.R#!/usr/bin/env Rscript
# Load libaries
library(tidyverse)
library(DESeq2)
library(RColorBrewer)
library(gplots)
library(EnhancedVolcano)
library(ggfortify)
library(glue)
library(plotly)
# Load count matrix file from featurecounts and clean up the header
counts <- read_tsv("analysis/featurecounts/featurecounts.txt",
col_names=TRUE, skip=1)
new_name <- str_extract(names(counts)[endsWith(names(counts), "bam")],
"(?<=star/).*(?=_Aligned)")
names(counts)[endsWith(names(counts), "bam")] <- new_name
counts <- counts %>%
select(-Chr, -Start, -End, -Strand, -Length) %>%
column_to_rownames(var = "Geneid") %>%
as.matrix()
# Load sample info and metadata
sampleinfo <- read_csv("raw/metadata.csv", col_names=TRUE)
sampleinfo <- sampleinfo %>%
mutate(genotype = factor(genotype, levels = c("wt", "miR163_mut"))) %>%
arrange(samplename) %>%
column_to_rownames(var = "samplename") %>%
as.matrix()
write.table(as.data.frame(counts),
file = "analysis/featurecounts/count_matrix.csv",
sep = ",",
quote = FALSE,
row.names = F)
# Load annotation file
annot <- read_csv("genome/ath_gene_annot.csv", col_names=TRUE)
# Construct DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = sampleinfo,
design = ~ genotype)
# Reorder factor level
dds$genotype<- relevel(dds$genotype, ref = "wt")
# Pre-filter low counts
keep <- rowSums(counts(dds)) >= 5
dds <- dds[keep,]
dds <- estimateSizeFactors(dds)
normalized.counts <- as.data.frame(counts(dds, normalized=TRUE)) %>%
as_tibble(rownames = "locus") %>%
left_join(., annot, c('locus'='tair_locus'))
# write to file
write.table(normalized.counts,"analysis/deseq2/genelists/DESeq2_normalizedcounts.txt",
sep = "\t",
quote = FALSE,
row.names = FALSE)
############## Pre-Processing Quality Assessment (START) ##############
## Library sizes bar plot
librarySizes <- colSums(counts)
pdf("analysis/deseq2/plots/lib-size.barplots.pdf", width = 7, height = 7)
barplot(librarySizes,
names=names(librarySizes),
las=2,
main="Barplot of library sizes")
abline(h=20e6, lty=2)
dev.off()
png("analysis/deseq2/plots/lib-size.barplots.png")
barplot(librarySizes,
names=names(librarySizes),
las=2,
main="Barplot of library sizes")
abline(h=20e6, lty=2)
dev.off()
## Count distribution Boxplots
logcounts <- log2(counts + 1) # no normalization
title_boxplot <- paste("Count Distribution Boxplots")
pdf("analysis/deseq2/plots/count-dist.boxplots.raw.pdf", width = 7, height = 7)
boxplot(logcounts,
main=title_boxplot,
xlab="",
ylab="Log2(Counts+1)",
las=2,
col=c(rep("red",3), rep("blue",3)))
abline(h=median(as.matrix(logcounts)), col="blue")
dev.off()
png("analysis/deseq2/plots/count-dist.boxplots.raw.png")
boxplot(logcounts,
main=title_boxplot,
xlab="",
ylab="Log2(Counts+1)",
las=2,
col=c(rep("red",3),rep("blue",3)))
abline(h=median(as.matrix(logcounts)), col="blue")
dev.off()
rlogcounts <- rlog(counts) #normalization
title_boxplot_rlog <- paste("Count Distribution Boxplots (rlog)")
pdf("analysis/deseq2/plots/count-dist.boxplots.rlog.pdf", width = 7, height = 7)
boxplot(rlogcounts,
main=title_boxplot_rlog,
xlab="",
ylab="rlog(Counts)",
las=2,
col=c(rep("red",3),rep("blue",3)))
dev.off()
png("analysis/deseq2/plots/count-dist.boxplots.rlog.png")
boxplot(rlogcounts,
main=title_boxplot_rlog,
xlab="",
ylab="rlog(Counts)",
las=2,
col=c(rep("red",3),rep("blue",3)))
dev.off()
## Correlation heatmaps
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
rld <- rlog(dds, blind=TRUE)
corr_samps <- cor(as.matrix(assay(rld)))
pdf("analysis/deseq2/plots/cor-heatmap.pdf", width = 7, height = 7)
heatmap.2(corr_samps,
trace = "none",
col = colors,
main = "Sample correlations")
dev.off()
png("analysis/deseq2/plots/cor-heatmap.png")
heatmap.2(corr_samps,
trace = "none",
col = colors,
main = "Sample correlations")
dev.off()
## Principal Component Analysis
# autoplot requires the ggfortify library
pcDat <- prcomp(t(rlogcounts))
title_pca <- paste("PC1 vs PC2, All Samples")
pdf("analysis/deseq2/plots/PCA.pdf", width = 7, height = 7)
autoplot(pcDat,
main=title_pca,
data = sampleinfo,
colour="genotype",
shape=FALSE)
dev.off()
png("analysis/deseq2/plots/PCA.png")
autoplot(pcDat,
main=title_pca,
data = sampleinfo,
colour="genotype",
shape=FALSE)
dev.off()
## Plotly version of PCA plot
p <- autoplot(pcDat,
main=title_pca,
data = sampleinfo,
colour = "genotype",
label = F,
label.repel = F,
size = 8,
group = "time",
shape = "time")
plt <- plotly::ggplotly(p)
htmlwidgets::saveWidget(plt, "analysis/deseq2/plots/PCA.html")
############## Pre-Processing Quality Assessment (END) ##############
############## Differential Expression (START)##############
# Calculate DE for all contrasts
dds <- DESeq(dds)
# Extract different contrast (i.e., comparisons)
contrast_list <- resultsNames(dds)
contrast_list <- contrast_list[contrast_list != "Intercept"] # remove Intercept
# Function to extract info for each contrast
deseq2_deg <- function(contrast_name){
# extracting DEG results
res <- results(dds, name=contrast_name)
res05 <- results(dds, name=contrast_name, lfcThreshold = 1, alpha = 0.05)
res05$padj <- p.adjust(res05$pvalue, method="BH")
res05 <- res05[!is.na(res05$padj),]
res_filt <- res[ which(res$padj < 0.1), ] # total DEGs
res05_filt <- res05[ which(res05$padj < 0.05), ] # total DEGs
up <- subset(res_filt, log2FoldChange > 0) # upDEGs (LFC > 0, p<0.1)
dn <- subset(res_filt, log2FoldChange < 0) # dnDEGs (LFC < 0, p<0.1)
up_05 <- subset(res05_filt, log2FoldChange > 1) # upDEGs (LFC > 1, p<0.05)
dn_05 <- subset(res05_filt, log2FoldChange < -1) # dnDEGs (LFC< -1, p<0.05)
# generates summary file
summary_file <- c(contrast_name,
nrow(res_filt),
nrow(up),
nrow(dn),
nrow(res05_filt),
nrow(up_05),
nrow(dn_05))
# export lists
degs <- subset(res, padj < 0.1) %>%
as_tibble(rownames = "locus") %>%
left_join(., annot, c('locus'='tair_locus'))
degs_05 <- subset(res05, padj < 0.05, log2FoldChange > 1) %>%
as_tibble(rownames = "locus") %>%
left_join(., annot, c('locus'='tair_locus'))
write.table(degs,
file = paste0("analysis/deseq2/genelists/",contrast_name,"_DEGs.LFC0_p1.txt"),
quote = FALSE,
sep= "\t",
row.names = FALSE)
write.table(degs_05,
file = paste0("analysis/deseq2/genelists/",contrast_name,"_DEGs.LFC1_p05.txt"),
quote = FALSE,
sep= "\t",
row.names = FALSE)
# post-processing plots
# volcano plots with EnhancedVolcano
subtitle1 <- paste("LFC > 0, FDR < 0.1")
subtitle2 <- paste("LFC > 1, FDR < 0.05")
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue',
title = contrast_name,
subtitle = subtitle1,
pCutoff = 0.1,
pCutoffCol = 'padj', # p-value
FCcutoff = 0,
pointSize = 4.0,
labSize = 6.0,
drawConnectors = TRUE,
widthConnectors = 0.75)
ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC0-FDR1.pdf"), width = 7, height = 7)
ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC0-FDR1.png"), scale = 2)
EnhancedVolcano(res05,
lab = rownames(res05),
x = 'log2FoldChange',
y = 'pvalue',
title = contrast_name,
subtitle = subtitle2,
pCutoff = 0.05,
pCutoffCol = 'padj', # p-value
FCcutoff = 1,
pointSize = 4.0,
labSize = 6.0,
drawConnectors = TRUE,
widthConnectors = 0.75)
ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC1-FDR05.pdf"), width = 7, height = 7)
ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC1-FDR05.png"), scale = 2)
return(summary_file)
}
# Run function for all contrast
sum_file <- sapply(contrast_list, deseq2_deg)
sum_file_tbl <- sum_file %>%
t() %>%
as_tibble() %>%
filter(V1 != "Intercept") %>%
select("Comparison"=V1,
"Total DEGs (p < 0.1)"=V2,
"Up DEGs (p < 0.1, LFC > 0)"=V3,
"Down DEGs (p < 0.1, LFC < 0)"=V4,
"Total DEGs (p < 0.05)"=V5,
"Up DEGs (p < 0.05, LFC > 1)"=V6,
"Down DEGs (p < 0.05, LFC < -1)"=V7) %>%
mutate(across(2:7, as.integer))
write.table(sum_file_tbl,
file = "analysis/deseq2/genelists/DESeq2.DEG_summary.txt",
sep = "\t",
row.names = FALSE,
quote = FALSE)The custom DESeq2 script generates several plots, tables, and genelists organized into the following directories.
Lets examine some of the pre-processing information about the dataset.
DEGs are defined using two filters:
$ls analysis/deseq2/genelists
DESeq2.DEG_summary.txt
DESeq2_normalizedcounts.txt
genotype_miR163_mut_vs_wt_DEGs.LFC0_p1.txt
genotype_miR163_mut_vs_wt_DEGs.LFC1_p05.txtMultiQC is used to generate a summary of the raw and processed data. The results are provided in an HTML file.
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=1
#SBATCH --mem=12g
#SBATCH --time=0-4:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="multiqc"
#SBATCH --output=log/%x_%j.log
##################################################################
## Shell script to generate MultiQC summary
## Load environment
source activate DEG-analysis
## Setting variables
OUT_DIR=analysis/multiqc
# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"
# Run multiqc
multiqc --force \
--dirs \
--outdir ${OUT_DIR} \
--filename multiqc \
analysisTo inspect and examine your RNA-seq results, you can load the BigWig or BAM files into IGV.
Show desktop demo